AITopics | data quality issue

Collaborating Authors

data quality issue

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

a7ce9b6a4db012cdaac28dd48989a17d-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-17-2026, 06:30:11 GMT

data mining, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Switzerland > Basel-City > Basel (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(5 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry:

Health & Medicine > Therapeutic Area > Dermatology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Nuclear Medicine (0.67)
Information Technology (0.67)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
(7 more...)

Add feedback

Does Homophily Help in Robust Test-time Node Classification?

Jiang, Yan, Qiu, Ruihong, Huang, Zi

arXiv.org Artificial IntelligenceOct-28-2025

Homophily, the tendency of nodes from the same class to connect, is a fundamental property of real-world graphs, underpinning structural and semantic patterns in domains such as citation networks and social networks. Existing methods exploit homophily through designing homophily-aware GNN architectures or graph structure learning strategies, yet they primarily focus on GNN learning with training graphs. However, in real-world scenarios, test graphs often suffer from data quality issues and distribution shifts, such as domain shifts across users from different regions in social networks and temporal evolution shifts in citation network graphs collected over varying time periods. These factors significantly compromise the pre-trained model's robustness, resulting in degraded test-time performance. With empirical observations and theoretical analysis, we reveal that transforming the test graph structure by increasing homophily in homophilic graphs or decreasing it in heterophilic graphs can significantly improve the robustness and performance of pre-trained GNNs on node classifications, without requiring model training or update. Motivated by these insights, a novel test-time graph structural transformation method grounded in homophily, named GrapHoST, is proposed. Specifically, a homophily predictor is developed to discriminate test edges, facilitating adaptive test-time graph structural transformation by the confidence of predicted homophily scores. Extensive experiments on nine benchmark datasets under a range of test-time data quality issues demonstrate that GrapHoST consistently achieves state-of-the-art performance, with improvements of up to 10.92%. Our code has been released at https://github.com/YanJiangJerry/GrapHoST.

data quality, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.22289

Country: North America > United States > Idaho (0.16)

Genre: Research Report (0.81)

Industry:

Information Technology > Services (0.54)
Media > Film (0.45)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Communications > Social Media (0.87)

Add feedback

Intrinsic Self-Supervision for Data Quality Audits

Neural Information Processing SystemsOct-10-2025, 12:33:48 GMT

Benchmark datasets in computer vision often contain off-topic images, near duplicates, and label errors, leading to inaccurate estimates of model performance.

data quality issue, dataset, duplicate, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Switzerland > Basel-City > Basel (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(5 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry:

Health & Medicine > Therapeutic Area > Dermatology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Nuclear Medicine (0.67)
Information Technology (0.67)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
(7 more...)

Add feedback

CleanPatrick: A Benchmark for Image Data Cleaning

Gröger, Fabian, Lionetti, Simone, Gottfrois, Philippe, Gonzalez-Jimenez, Alvaro, Amruthalingam, Ludovic, Goessinger, Elisabeth Victoria, Lindemann, Hanna, Bargiela, Marie, Hofbauer, Marie, Badri, Omar, Tschandl, Philipp, Koochek, Arash, Groh, Matthew, Navarini, Alexander A., Pouly, Marc

arXiv.org Artificial IntelligenceMay-19-2025

Robust machine learning depends on clean data, yet current image data cleaning benchmarks rely on synthetic noise or narrow human studies, limiting comparison and real-world relevance. We introduce CleanPatrick, the first large-scale benchmark for data cleaning in the image domain, built upon the publicly available Fitzpatrick17k dermatology dataset. We collect 496,377 binary annotations from 933 medical crowd workers, identify off-topic samples (4%), near-duplicates (21%), and label errors (22%), and employ an aggregation model inspired by item-response theory followed by expert review to derive high-quality ground truth. CleanPatrick formalizes issue detection as a ranking task and adopts typical ranking metrics mirroring real audit workflows. Benchmarking classical anomaly detectors, perceptual hashing, SSIM, Confident Learning, NoiseRank, and SelfClean, we find that, on CleanPatrick, self-supervised representations excel at near-duplicate detection, classical methods achieve competitive off-topic detection under constrained review budgets, and label-error detection remains an open challenge for fine-grained medical classification. By releasing both the dataset and the evaluation framework, CleanPatrick enables a systematic comparison of image-cleaning strategies and paves the way for more reliable data-centric artificial intelligence.

artificial intelligence, data quality, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2505.11034

Country:

Europe > Switzerland > Basel-City > Basel (0.04)
North America > United States (0.04)
North America > Canada (0.04)
(2 more...)

Genre: Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Therapeutic Area > Dermatology (0.90)

Technology:

Information Technology > Data Science > Data Quality > Data Cleaning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.92)

Add feedback

Formative Study for AI-assisted Data Visualization

Saber, Rania, Fariha, Anna

arXiv.org Artificial IntelligenceSep-10-2024

This formative study investigates the impact of data quality on AI-assisted data visualizations, focusing on how uncleaned datasets influence the outcomes of these tools. By generating visualizations from datasets with inherent quality issues, the research aims to identify and categorize the specific visualization problems that arise. The study further explores potential methods and tools to address these visualization challenges efficiently and effectively. Although tool development has not yet been undertaken, the findings emphasize enhancing AI visualization tools to handle flawed data better. This research underscores the critical need for more robust, user-friendly solutions that facilitate quicker and easier correction of data and visualization errors, thereby improving the overall reliability and usability of AI-assisted data visualization processes.

data quality issue, dataset, visualization, (14 more...)

arXiv.org Artificial Intelligence

2409.06892

Country:

North America > United States > New York > New York County > New York City (0.14)
Europe > Spain (0.04)
Europe > France (0.04)
(9 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.68)

Technology:

Information Technology > Visualization (1.00)
Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.51)

Add feedback

Towards Reliable Dermatology Evaluation Benchmarks

Gröger, Fabian, Lionetti, Simone, Gottfrois, Philippe, Gonzalez-Jimenez, Alvaro, Groh, Matthew, Daneshjou, Roxana, Consortium, Labelling, Navarini, Alexander A., Pouly, Marc

arXiv.org Artificial IntelligenceDec-16-2023

Benchmark datasets for digital dermatology unwittingly contain inaccuracies that reduce trust in model performance estimates. We propose a resource-efficient data-cleaning protocol to identify issues that escaped previous curation. The protocol leverages an existing algorithmic cleaning strategy and is followed by a confirmation process terminated by an intuitive stopping criterion. Based on confirmation by multiple dermatologists, we remove irrelevant samples and near duplicates and estimate the percentage of label errors in six dermatology image datasets for model evaluation promoted by the International Skin Imaging Collaboration. Along with this paper, we publish revised file lists for each dataset which should be used for model evaluation. Our work paves the way for more trustworthy performance assessment in digital dermatology.

auprg, auroc, dataset, (12 more...)

arXiv.org Artificial Intelligence

2309.06961

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > Switzerland > Basel-City > Basel (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(4 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.94)

Industry: Health & Medicine > Therapeutic Area > Dermatology (1.00)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Quality Issues in Machine Learning Software Systems

Côté, Pierre-Olivier, Nikanjam, Amin, Bouchoucha, Rached, Basta, Ilan, Abidi, Mouna, Khomh, Foutse

arXiv.org Artificial IntelligenceJun-26-2023

Context: An increasing demand is observed in various domains to employ Machine Learning (ML) for solving complex problems. ML models are implemented as software components and deployed in Machine Learning Software Systems (MLSSs). Problem: There is a strong need for ensuring the serving quality of MLSSs. False or poor decisions of such systems can lead to malfunction of other systems, significant financial losses, or even threats to human life. The quality assurance of MLSSs is considered a challenging task and currently is a hot research topic. Objective: This paper aims to investigate the characteristics of real quality issues in MLSSs from the viewpoint of practitioners. This empirical study aims to identify a catalog of quality issues in MLSSs. Method: We conduct a set of interviews with practitioners/experts, to gather insights about their experience and practices when dealing with quality issues. We validate the identified quality issues via a survey with ML practitioners. Results: Based on the content of 37 interviews, we identified 18 recurring quality issues and 24 strategies to mitigate them. For each identified issue, we describe the causes and consequences according to the practitioners' experience. Conclusion: We believe the catalog of issues developed in this study will allow the community to develop efficient quality assurance tools for ML models and MLSSs. A replication package of our study is available on our public GitHub repository.

artificial intelligence, machine learning, practitioner, (16 more...)

arXiv.org Artificial Intelligence

2306.15007

Country:

North America > Canada > Quebec > Montreal (0.04)
South America (0.04)
North America > United States > New York > New York County > New York City (0.04)
(8 more...)

Genre:

Research Report > New Finding (1.00)
Questionnaire & Opinion Survey (1.00)
Personal > Interview (1.00)
Overview (1.00)

Industry:

Information Technology (1.00)
Education > Educational Technology > Educational Software > Computer Based Training (0.61)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

are-your-data-quality-enough-to-support-machine-learning-ai-plans

#artificialintelligenceFeb-12-2023, 12:01:27 GMT

AI is a priority for governments and businesses worldwide. Poor data quality is a key aspect of AI that has been overlooked. AI algorithms are based on reliable data in order to produce optimal results. However, if the data is incomplete, incorrect, or not sufficient, it can have devastating consequences. Poor data quality can result in adverse outcomes for AI systems that identify patients' diseases. These systems can produce inaccurate diagnoses and predictions, which can lead to misdiagnosis and delayed treatment.

data quality, data quality issue, quality issue, (7 more...)

#artificialintelligence

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.05)
Europe > Netherlands > South Holland > The Hague (0.05)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence (1.00)

Add feedback

Interactive data prep widget for notebooks powered by Amazon SageMaker Data Wrangler

#artificialintelligenceDec-1-2022, 22:01:21 GMT

According to a 2020 survey of data scientists conducted by Anaconda, data preparation is one of the critical steps in machine learning (ML) and data analytics workflows, and often very time consuming for data scientists. Data scientists spend about 66% of their time on data preparation and analysis tasks, including loading (19%), cleaning (26%), and visualizing data (21%). Amazon SageMaker Studio is the first fully integrated development environment (IDE) for ML. With a single click, data scientists and developers can quickly spin up Studio notebooks to explore datasets and build models. If you prefer a GUI-based and interactive interface, you can use Amazon SageMaker Data Wrangler, with over 300 built in visualizations, analyses, and transformations to efficiently process data backed by Spark without writing a single line of code.

dataset, sagemaker data wrangler, widget, (13 more...)

#artificialintelligence

Country:

Pacific Ocean > North Pacific Ocean > San Francisco Bay (0.05)
North America > United States > California > San Francisco County > San Francisco (0.05)

Industry: Retail > Online (0.40)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Council Post: Data Quality Is Also An AI Problem

#artificialintelligenceNov-7-2022, 04:25:07 GMT

Emanuel Younanzadeh is VP Marketing at The Modern Data Company. Artificial intelligence (AI) continues its rise to prominence within the business world. The number of companies using AI today and the range of problems AI is being applied to are both increasing steadily. However, there is one issue that is plaguing AI just as much as it has plagued analytics of all kinds over the years--data quality. Organizations put tremendous resources behind ensuring the quality of their data.

data quality, data type, quality issue, (8 more...)

#artificialintelligence

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence (1.00)

Add feedback